{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning Prognostics for Turbofan Engine Degradation Dataset\n",
    "\n",
    "Information about the problem is at http://ti.arc.nasa.gov/tech/dash/pcoe/prognostic-data-repository/publications/#turbofan and original data is at http://ti.arc.nasa.gov/tech/dash/pcoe/prognostic-data-repository/#turbofan\n",
    "\n",
    "The data was originally generated using the Commercial Modular Aero-Propulsion System Simulations (C-MAPPS) system.\n",
    "\n",
    "The approach used in the turbofan engine degradation dataset was then used in the PHM08 challenge.  Information about other research on the C-MAPSS data is available at  https://www.phmsociety.org/sites/phmsociety.org/files/phm_submission/2014/phmc_14_063.pdf\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import h2o\n",
    "from h2o.estimators.glm import H2OGeneralizedLinearEstimator\n",
    "from h2o.estimators.gbm import H2OGradientBoostingEstimator\n",
    "from h2o.utils.shared_utils import _locate\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import pykalman as pyk\n",
    "\n",
    "\n",
    "sns.set()\n",
    "doGridSearch = True\n",
    "doKalmanSmoothing = False #unrelated to h2o, set true for demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Input files don't have column names\n",
    "dependent_vars = ['RemainingUsefulLife']\n",
    "index_columns_names =  [\"UnitNumber\",\"Cycle\"]\n",
    "operational_settings_columns_names = [\"OpSet\"+str(i) for i in range(1,4)]\n",
    "sensor_measure_columns_names =[\"SensorMeasure\"+str(i) for i in range(1,22)]\n",
    "input_file_column_names = index_columns_names + operational_settings_columns_names + sensor_measure_columns_names\n",
    "\n",
    "# And we are going to add these columns\n",
    "kalman_smoothed_mean_columns_names =[\"SensorMeasureKalmanMean\"+str(i) for i in range(1,22)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read in the raw files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train = pd.read_csv(\"http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/train_FD001.txt\", sep=r\"\\s*\", header=None,\n",
    "                   names=input_file_column_names, engine='python')\n",
    "test  = pd.read_csv(\"http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/test_FD001.txt\",  sep=r\"\\s*\", header=None,\n",
    "                   names=input_file_column_names, engine='python')\n",
    "test_rul = pd.read_csv(\"http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/RUL_FD001.txt\", header=None, names=['RemainingUsefulLife'])\n",
    "test_rul.index += 1  # set the index to be the unit number in the test data set\n",
    "test_rul.index.name = \"UnitNumber\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate Remaining Useful Life in T-minus notation for the training data\n",
    "This puts all data on the same basis for supervised training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Calculate the remaining useful life for each training sample based on last measurement being zero remaining\n",
    "grouped_train = train.groupby('UnitNumber', as_index=False)\n",
    "useful_life_train = grouped_train.agg({'Cycle' : np.max })\n",
    "useful_life_train.rename(columns={'Cycle': 'UsefulLife'}, inplace=True)\n",
    "train_wfeatures = pd.merge(train, useful_life_train, on=\"UnitNumber\")\n",
    "train_wfeatures[\"RemainingUsefulLife\"] = -(train_wfeatures.UsefulLife - train_wfeatures.Cycle)\n",
    "train_wfeatures.drop('UsefulLife', axis=1, inplace=True)\n",
    "\n",
    "grouped_test = test.groupby('UnitNumber', as_index=False)\n",
    "useful_life_test = grouped_test.agg({'Cycle' : np.max })\n",
    "useful_life_test.rename(columns={'Cycle': 'UsefulLife'}, inplace=True)\n",
    "test_wfeatures = pd.merge(test, useful_life_test, on=\"UnitNumber\")\n",
    "test_wfeatures[\"RemainingUsefulLife\"] = -(test_wfeatures.UsefulLife - test_wfeatures.Cycle)\n",
    "test_wfeatures.drop('UsefulLife', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis\n",
    "\n",
    "Look at how the sensor measures evolve over time (first column) as well as how they relate to each other for a subset of the units.\n",
    "\n",
    "These features were the top 3 and bottom 2 most important sensor features as discovered by H2O's GBM, later in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sns.set_context(\"notebook\", font_scale=1.5)\n",
    "p = sns.pairplot(train_wfeatures.query('UnitNumber < 10'),\n",
    "                 vars=[\"RemainingUsefulLife\", \"SensorMeasure4\", \"SensorMeasure3\",\n",
    "                       \"SensorMeasure9\", \"SensorMeasure8\", \"SensorMeasure13\"], size=10,\n",
    "                 hue=\"UnitNumber\", palette=sns.color_palette(\"husl\", 9));\n",
    "sns.plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Signal processing using Kalman smoothing filter\n",
    "Kalman parameters were determined using EM algorithm and then those parameters are used for smoothing the signal data.\n",
    "\n",
    "This is applied repeatedly to each Unit, in both the training and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kalman_smoothed_mean_columns_names =[\"SensorMeasureKalmanMean\"+str(i) for i in range(1,22)]\n",
    "\n",
    "def calcSmooth(measures):\n",
    "    kf = pyk.KalmanFilter(initial_state_mean=measures[0], n_dim_obs=measures.shape[1])\n",
    "    (smoothed_state_means, smoothed_state_covariances) = kf.em(measures).smooth(measures)\n",
    "    return smoothed_state_means\n",
    "\n",
    "def filterEachUnit(df):\n",
    "    dfout = df.copy()\n",
    "\n",
    "    for newcol in kalman_smoothed_mean_columns_names:\n",
    "        dfout[newcol] = np.nan\n",
    "\n",
    "    for unit in dfout.UnitNumber.unique():\n",
    "        sys.stdout.write('\\rProcessing Unit: %d' % unit)\n",
    "        sys.stdout.flush()\n",
    "        unitmeasures = dfout[dfout.UnitNumber == unit][sensor_measure_columns_names]\n",
    "        smoothed_state_means = calcSmooth( np.asarray( unitmeasures ) )\n",
    "        dfout.loc[dfout.UnitNumber == unit, kalman_smoothed_mean_columns_names] = smoothed_state_means\n",
    "        sys.stdout.write('\\rProcessing Unit: %d' % unit)\n",
    "        sys.stdout.flush()\n",
    "    sys.stdout.write('\\rFinished\\n')\n",
    "    sys.stdout.flush()\n",
    "\n",
    "    return dfout   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Output the results to files\n",
    "Helps so preprocessing only has to be done once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get picky about the order of output columns\n",
    "test_output_cols = index_columns_names + operational_settings_columns_names + sensor_measure_columns_names + \\\n",
    "                   kalman_smoothed_mean_columns_names\n",
    "train_output_cols = test_output_cols + dependent_vars\n",
    "\n",
    "if doKalmanSmoothing:\n",
    "    train_wkalman = filterEachUnit(train_wfeatures)\n",
    "    test_wkalman = filterEachUnit(test_wfeatures)\n",
    "\n",
    "    train_output = train_wkalman[train_output_cols]\n",
    "    test_output = test_wkalman[test_output_cols]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Output the files, so we don't have to do the preprocessing again.\n",
    "if doKalmanSmoothing:\n",
    "    train_output.to_csv(\"train_FD001_preprocessed.csv\", index=False)\n",
    "    test_output.to_csv(\"test_FD001_preprocessed.csv\", index=False)\n",
    "    test_rul.to_csv(\"rul_FD001_preprocessed.csv\", index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Startup H2O"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"overflow:auto\"><table style=\"width:50%\"><tr><td>H2O cluster uptime: </td>\n",
       "<td>6 hours 33 minutes 13 seconds 411 milliseconds </td></tr>\n",
       "<tr><td>H2O cluster version: </td>\n",
       "<td>3.5.0.99999</td></tr>\n",
       "<tr><td>H2O cluster name: </td>\n",
       "<td>Kevin</td></tr>\n",
       "<tr><td>H2O cluster total nodes: </td>\n",
       "<td>1</td></tr>\n",
       "<tr><td>H2O cluster total memory: </td>\n",
       "<td>3.54 GB</td></tr>\n",
       "<tr><td>H2O cluster total cores: </td>\n",
       "<td>8</td></tr>\n",
       "<tr><td>H2O cluster allowed cores: </td>\n",
       "<td>8</td></tr>\n",
       "<tr><td>H2O cluster healthy: </td>\n",
       "<td>True</td></tr>\n",
       "<tr><td>H2O Connection ip: </td>\n",
       "<td>127.0.0.1</td></tr>\n",
       "<tr><td>H2O Connection port: </td>\n",
       "<td>54321</td></tr></table></div>"
      ],
      "text/plain": [
       "--------------------------  ----------------------------------------------\n",
       "H2O cluster uptime:         6 hours 33 minutes 13 seconds 411 milliseconds\n",
       "H2O cluster version:        3.5.0.99999\n",
       "H2O cluster name:           Kevin\n",
       "H2O cluster total nodes:    1\n",
       "H2O cluster total memory:   3.54 GB\n",
       "H2O cluster total cores:    8\n",
       "H2O cluster allowed cores:  8\n",
       "H2O cluster healthy:        True\n",
       "H2O Connection ip:          127.0.0.1\n",
       "H2O Connection port:        54321\n",
       "--------------------------  ----------------------------------------------"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "h2o.init()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load training and final test data into H2O"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Parse Progress: [##################################################] 100%\n",
      "Imported http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/train_FD001_preprocessed.csv. Parsed 20,631 rows and 48 cols\n",
      "\n",
      "Parse Progress: [##################################################] 100%\n",
      "Imported http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/test_FD001_preprocessed.csv. Parsed 13,096 rows and 47 cols\n"
     ]
    }
   ],
   "source": [
    "#Pull Kalman-smoothed data if generated locally, or source from AWS\n",
    "if doKalmanSmoothing:\n",
    "    train_hex = h2o.import_file(_locate(\"train_FD001_preprocessed.csv\"))\n",
    "    test_hex = h2o.import_file(_locate(\"test_FD001_preprocessed.csv\"))\n",
    "else:\n",
    "    train_hex = h2o.import_file(\"http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/train_FD001_preprocessed.csv\")\n",
    "    test_hex = h2o.import_file(\"http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/test_FD001_preprocessed.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup independent and dependent features\n",
    "\n",
    "Use the operational settings and Kalman smoothed mean states as the independent features\n",
    "\n",
    "Setup a fold column to great cross validation models from 90 units and cross validating on 10 units.  This creates a 10-fold cross validation.  The cross validation models are then used to create an ensemble model for predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "xCols= operational_settings_columns_names + kalman_smoothed_mean_columns_names\n",
    "yCol = dependent_vars\n",
    "\n",
    "foldCol = \"UnitNumberMod10\"\n",
    "train_hex[foldCol] = train_hex[\"UnitNumber\"] % 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train a series of GLM Models using Grid Search over $\\alpha$ and $\\lambda$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n",
      "\n",
      "glm Model Build Progress: [##################################################] 100%\n"
     ]
    }
   ],
   "source": [
    "def trainGLM(x, y, fold_column, training_frame, alpha=0.5, penalty=1e-5):\n",
    "    model = H2OGeneralizedLinearEstimator(family = \"gaussian\",alpha = [alpha], Lambda = [penalty])\n",
    "    model.train(x=x, y=y, training_frame=training_frame, fold_column=fold_column)\n",
    "    return model\n",
    "\n",
    "def gridSearchGLM(x, y, fold_column, training_frame, alphas = [0,0.5,1], penalties=np.logspace(-3,0,num=4)):\n",
    "    results = []\n",
    "    for alpha in alphas:\n",
    "        for penalty in penalties:\n",
    "            results.append( trainGLM(x, y, fold_column, training_frame, alpha, penalty) )\n",
    "    return results\n",
    "\n",
    "if doGridSearch:\n",
    "    glmModels = gridSearchGLM(xCols, yCol, foldCol, train_hex)\n",
    "else:\n",
    "    # this is used to speed up the demonstration by just building the single model previously found\n",
    "    glmModels = [ trainGLM(xCols, yCol, foldCol, train_hex, alpha=1, penalty=0.01 )]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract the 'best' model\n",
    "\n",
    "Uses model with lowest MSE on the cross validation data.  \n",
    "\n",
    "This is a reasonable substitute for using the final scoring method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model Details\n",
      "=============\n",
      "H2OGeneralizedLinearEstimator :  Generalized Linear Model\n",
      "Model Key:  GLM_model_python_1445965974785_270\n",
      "\n",
      "GLM Model: summary\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"overflow:auto\"><table style=\"width:50%\"><tr><td><b></b></td>\n",
       "<td><b>family</b></td>\n",
       "<td><b>link</b></td>\n",
       "<td><b>regularization</b></td>\n",
       "<td><b>number_of_predictors_total</b></td>\n",
       "<td><b>number_of_active_predictors</b></td>\n",
       "<td><b>number_of_iterations</b></td>\n",
       "<td><b>training_frame</b></td></tr>\n",
       "<tr><td></td>\n",
       "<td>gaussian</td>\n",
       "<td>identity</td>\n",
       "<td>Ridge ( lambda = 0.01 )</td>\n",
       "<td>17</td>\n",
       "<td>18</td>\n",
       "<td>1</td>\n",
       "<td>Key_Frame__http___h2o_public_test_data_s3_amazonaws_com_bigdata_laptop_CMAPSSData_train_FD001_preprocessed.hex</td></tr></table></div>"
      ],
      "text/plain": [
       "    family    link      regularization           number_of_predictors_total    number_of_active_predictors    number_of_iterations    training_frame\n",
       "--  --------  --------  -----------------------  ----------------------------  -----------------------------  ----------------------  --------------------------------------------------------------------------------------------------------------\n",
       "    gaussian  identity  Ridge ( lambda = 0.01 )  17                            18                             1                       Key_Frame__http___h2o_public_test_data_s3_amazonaws_com_bigdata_laptop_CMAPSSData_train_FD001_preprocessed.hex"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "ModelMetricsRegressionGLM: glm\n",
      "** Reported on train data. **\n",
      "\n",
      "MSE: 1907.65887465\n",
      "R^2: 0.597910247255\n",
      "Mean Residual Deviance: 1907.65887465\n",
      "Null degrees of freedom: 20630\n",
      "Residual degrees of freedom: 20613\n",
      "Null deviance: 97880908.3648\n",
      "Residual deviance: 39356910.2429\n",
      "AIC: 214425.224563\n",
      "\n",
      "ModelMetricsRegressionGLM: glm\n",
      "** Reported on cross-validation data. **\n",
      "\n",
      "MSE: 1977.53644453\n",
      "R^2: 0.583181694279\n",
      "Mean Residual Deviance: 1977.53644453\n",
      "Null degrees of freedom: 20630\n",
      "Residual degrees of freedom: 20613\n",
      "Null deviance: 98171005.0908\n",
      "Residual deviance: 40798554.387\n",
      "AIC: 215167.426437\n",
      "\n",
      "Scoring History:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"overflow:auto\"><table style=\"width:50%\"><tr><td><b></b></td>\n",
       "<td><b>timestamp</b></td>\n",
       "<td><b>duration</b></td>\n",
       "<td><b>iteration</b></td>\n",
       "<td><b>log_likelihood</b></td>\n",
       "<td><b>objective</b></td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:46:38</td>\n",
       "<td> 0.000 sec</td>\n",
       "<td>0</td>\n",
       "<td>48955702.8</td>\n",
       "<td>2372.9</td></tr></table></div>"
      ],
      "text/plain": [
       "    timestamp            duration    iteration    log_likelihood    objective\n",
       "--  -------------------  ----------  -----------  ----------------  -----------\n",
       "    2015-10-27 16:46:38  0.000 sec   0            4.89557e+07       2372.92"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def extractBestModel(models):\n",
    "    bestMse = models[0].mse(xval=True)\n",
    "    result = models[0]\n",
    "    for model in models:\n",
    "        if model.mse(xval=True) < bestMse:\n",
    "            bestMse = model.mse(xval=True)\n",
    "            result = model\n",
    "    return result\n",
    "\n",
    "bestModel = extractBestModel(glmModels)\n",
    "bestModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build a series of GBM models using grid search for hyper-parameters\n",
    "\n",
    "Extract the 'best' model using the same approach as with GLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def trainGBM(x, y, fold_column, training_frame, learning_rate=0.1, ntrees=50, max_depth=5):\n",
    "    model = H2OGradientBoostingEstimator(distribution = \"gaussian\",\n",
    "                   learn_rate=learning_rate, ntrees=ntrees, max_depth=max_depth)\n",
    "    model.train(x=x, y=y, training_frame=training_frame, fold_column=fold_column)\n",
    "    return model\n",
    "\n",
    "def gridSearchGBM(x, y, fold_column, training_frame, learning_rates = [0.1,0.03,0.01], ntrees=[10,30,100,300], max_depth=[1,3,5]):\n",
    "    results = []\n",
    "    for learning_rate in learning_rates:\n",
    "        for ntree in ntrees:\n",
    "            for depth in max_depth:\n",
    "                print \"GBM: {learning rate: \"+str(learning_rate)+\"},{ntrees: \"+str(ntree)+\"},{max_depth: \"+str(depth)+\"}\"\n",
    "                results.append( trainGBM(x, y, fold_column, training_frame, learning_rate=learning_rate, ntrees=ntree, max_depth=depth) )\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GBM: {learning rate: 0.03},{ntrees: 50},{max_depth: 2}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n",
      "GBM: {learning rate: 0.03},{ntrees: 50},{max_depth: 5}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n",
      "GBM: {learning rate: 0.03},{ntrees: 200},{max_depth: 2}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n",
      "GBM: {learning rate: 0.03},{ntrees: 200},{max_depth: 5}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n",
      "GBM: {learning rate: 0.01},{ntrees: 50},{max_depth: 2}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n",
      "GBM: {learning rate: 0.01},{ntrees: 50},{max_depth: 5}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n",
      "GBM: {learning rate: 0.01},{ntrees: 200},{max_depth: 2}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n",
      "GBM: {learning rate: 0.01},{ntrees: 200},{max_depth: 5}\n",
      "\n",
      "gbm Model Build Progress: [##################################################] 100%\n"
     ]
    }
   ],
   "source": [
    "if doGridSearch:\n",
    "    #bmModels = gridSearchGBM(xCols, yCol, foldCol, train_hex,\\\n",
    "    #                        learning_rates=[0.03,0.01,0.003], ntrees=[100,300,500], max_depth=[1,3,5])\n",
    "\n",
    "    #run the below line for fast demo\n",
    "    gbmModels = gridSearchGBM(xCols, yCol, foldCol, train_hex, learning_rates=[0.03,0.01], ntrees=[50,200], max_depth=[2,5])\n",
    "else:\n",
    "    gbmModels = [trainGBM(xCols, yCol, foldCol, train_hex, \\\n",
    "                        ntrees=300, max_depth=5)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bestGbmModel = extractBestModel(gbmModels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Best model had depth 5, learning rate 0.01, and 300 trees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{u'balance_classes': {'actual': False, 'default': False},\n",
       " u'build_tree_one_node': {'actual': False, 'default': False},\n",
       " u'checkpoint': {'actual': None, 'default': None},\n",
       " u'class_sampling_factors': {'actual': None, 'default': None},\n",
       " u'col_sample_rate': {'actual': 1.0, 'default': 1.0},\n",
       " u'distribution': {'actual': u'gaussian', 'default': u'AUTO'},\n",
       " u'fold_assignment': {'actual': u'AUTO', 'default': u'AUTO'},\n",
       " u'fold_column': {'actual': {u'__meta': {u'schema_name': u'ColSpecifierV3',\n",
       "    u'schema_type': u'VecSpecifier',\n",
       "    u'schema_version': 3},\n",
       "   u'column_name': u'UnitNumberMod10',\n",
       "   u'is_member_of_frames': None},\n",
       "  'default': None},\n",
       " u'ignore_const_cols': {'actual': True, 'default': True},\n",
       " u'ignored_columns': {'actual': [u'SensorMeasure21',\n",
       "   u'SensorMeasure20',\n",
       "   u'SensorMeasure8',\n",
       "   u'SensorMeasure9',\n",
       "   u'SensorMeasure4',\n",
       "   u'SensorMeasure5',\n",
       "   u'SensorMeasure6',\n",
       "   u'SensorMeasure7',\n",
       "   u'SensorMeasure1',\n",
       "   u'SensorMeasure2',\n",
       "   u'SensorMeasure3',\n",
       "   u'SensorMeasure16',\n",
       "   u'SensorMeasure17',\n",
       "   u'SensorMeasure14',\n",
       "   u'SensorMeasure15',\n",
       "   u'SensorMeasure12',\n",
       "   u'SensorMeasure13',\n",
       "   u'SensorMeasure10',\n",
       "   u'SensorMeasure11',\n",
       "   u'SensorMeasure18',\n",
       "   u'SensorMeasure19',\n",
       "   u'UnitNumber',\n",
       "   u'Cycle'],\n",
       "  'default': None},\n",
       " u'keep_cross_validation_predictions': {'actual': False, 'default': False},\n",
       " u'learn_rate': {'actual': 0.03, 'default': 0.1},\n",
       " u'max_after_balance_size': {'actual': 5.0, 'default': 5.0},\n",
       " u'max_confusion_matrix_size': {'actual': 20, 'default': 20},\n",
       " u'max_depth': {'actual': 5, 'default': 5},\n",
       " u'max_hit_ratio_k': {'actual': 10, 'default': 10},\n",
       " u'min_rows': {'actual': 10.0, 'default': 10.0},\n",
       " u'model_id': {'actual': None, 'default': None},\n",
       " u'nbins': {'actual': 20, 'default': 20},\n",
       " u'nbins_cats': {'actual': 1024, 'default': 1024},\n",
       " u'nbins_top_level': {'actual': 1024, 'default': 1024},\n",
       " u'nfolds': {'actual': 0, 'default': 0},\n",
       " u'ntrees': {'actual': 200, 'default': 50},\n",
       " u'offset_column': {'actual': None, 'default': None},\n",
       " u'r2_stopping': {'actual': 0.999999, 'default': 0.999999},\n",
       " u'response_column': {'actual': {u'__meta': {u'schema_name': u'ColSpecifierV3',\n",
       "    u'schema_type': u'VecSpecifier',\n",
       "    u'schema_version': 3},\n",
       "   u'column_name': u'RemainingUsefulLife',\n",
       "   u'is_member_of_frames': None},\n",
       "  'default': None},\n",
       " u'sample_rate': {'actual': 1.0, 'default': 1.0},\n",
       " u'score_each_iteration': {'actual': False, 'default': False},\n",
       " u'seed': {'actual': 676607941053184637L, 'default': -4954682849530948794L},\n",
       " u'training_frame': {'actual': {u'URL': u'/3/Frames/Key_Frame__http___h2o_public_test_data_s3_amazonaws_com_bigdata_laptop_CMAPSSData_train_FD001_preprocessed.hex',\n",
       "   u'__meta': {u'schema_name': u'FrameKeyV3',\n",
       "    u'schema_type': u'Key<Frame>',\n",
       "    u'schema_version': 3},\n",
       "   u'name': u'Key_Frame__http___h2o_public_test_data_s3_amazonaws_com_bigdata_laptop_CMAPSSData_train_FD001_preprocessed.hex',\n",
       "   u'type': u'Key<Frame>'},\n",
       "  'default': None},\n",
       " u'tweedie_power': {'actual': 1.5, 'default': 1.5},\n",
       " u'validation_frame': {'actual': None, 'default': None},\n",
       " u'weights_column': {'actual': None, 'default': None}}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bestGbmModel.params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Best GBM Model reported MSE on cross validation data as 1687, an improvement from GLM of 1954."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model Details\n",
      "=============\n",
      "H2OGradientBoostingEstimator :  Gradient Boosting Machine\n",
      "Model Key:  GBM_model_python_1445965974785_298\n",
      "\n",
      "Model Summary:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"overflow:auto\"><table style=\"width:50%\"><tr><td><b></b></td>\n",
       "<td><b>number_of_trees</b></td>\n",
       "<td><b>model_size_in_bytes</b></td>\n",
       "<td><b>min_depth</b></td>\n",
       "<td><b>max_depth</b></td>\n",
       "<td><b>mean_depth</b></td>\n",
       "<td><b>min_leaves</b></td>\n",
       "<td><b>max_leaves</b></td>\n",
       "<td><b>mean_leaves</b></td></tr>\n",
       "<tr><td></td>\n",
       "<td>200.0</td>\n",
       "<td>79128.0</td>\n",
       "<td>5.0</td>\n",
       "<td>5.0</td>\n",
       "<td>5.0</td>\n",
       "<td>13.0</td>\n",
       "<td>32.0</td>\n",
       "<td>28.575</td></tr></table></div>"
      ],
      "text/plain": [
       "    number_of_trees    model_size_in_bytes    min_depth    max_depth    mean_depth    min_leaves    max_leaves    mean_leaves\n",
       "--  -----------------  ---------------------  -----------  -----------  ------------  ------------  ------------  -------------\n",
       "    200                79128                  5            5            5             13            32            28.575"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "ModelMetricsRegression: gbm\n",
      "** Reported on train data. **\n",
      "\n",
      "MSE: 1095.13677598\n",
      "R^2: 0.769170850551\n",
      "Mean Residual Deviance: 1095.13677598\n",
      "\n",
      "ModelMetricsRegression: gbm\n",
      "** Reported on cross-validation data. **\n",
      "\n",
      "MSE: 1694.88676263\n",
      "R^2: 0.64275761858\n",
      "Mean Residual Deviance: 1694.88676263\n",
      "\n",
      "Scoring History:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"overflow:auto\"><table style=\"width:50%\"><tr><td><b></b></td>\n",
       "<td><b>timestamp</b></td>\n",
       "<td><b>duration</b></td>\n",
       "<td><b>number_of_trees</b></td>\n",
       "<td><b>training_MSE</b></td>\n",
       "<td><b>training_deviance</b></td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:41</td>\n",
       "<td> 1 min 15.434 sec</td>\n",
       "<td>1.0</td>\n",
       "<td>4559.7</td>\n",
       "<td>4559.7</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:41</td>\n",
       "<td> 1 min 15.465 sec</td>\n",
       "<td>2.0</td>\n",
       "<td>4385.4</td>\n",
       "<td>4385.4</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:41</td>\n",
       "<td> 1 min 15.491 sec</td>\n",
       "<td>3.0</td>\n",
       "<td>4222.0</td>\n",
       "<td>4222.0</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:41</td>\n",
       "<td> 1 min 15.517 sec</td>\n",
       "<td>4.0</td>\n",
       "<td>4068.1</td>\n",
       "<td>4068.1</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:41</td>\n",
       "<td> 1 min 15.543 sec</td>\n",
       "<td>5.0</td>\n",
       "<td>3923.3</td>\n",
       "<td>3923.3</td></tr>\n",
       "<tr><td>---</td>\n",
       "<td>---</td>\n",
       "<td>---</td>\n",
       "<td>---</td>\n",
       "<td>---</td>\n",
       "<td>---</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:44</td>\n",
       "<td> 1 min 19.311 sec</td>\n",
       "<td>143.0</td>\n",
       "<td>1166.6</td>\n",
       "<td>1166.6</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:44</td>\n",
       "<td> 1 min 19.338 sec</td>\n",
       "<td>144.0</td>\n",
       "<td>1165.3</td>\n",
       "<td>1165.3</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:44</td>\n",
       "<td> 1 min 19.365 sec</td>\n",
       "<td>145.0</td>\n",
       "<td>1163.8</td>\n",
       "<td>1163.8</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:44</td>\n",
       "<td> 1 min 19.393 sec</td>\n",
       "<td>146.0</td>\n",
       "<td>1163.0</td>\n",
       "<td>1163.0</td></tr>\n",
       "<tr><td></td>\n",
       "<td>2015-10-27 16:49:46</td>\n",
       "<td> 1 min 20.789 sec</td>\n",
       "<td>200.0</td>\n",
       "<td>1095.1</td>\n",
       "<td>1095.1</td></tr></table></div>"
      ],
      "text/plain": [
       "     timestamp            duration          number_of_trees    training_MSE    training_deviance\n",
       "---  -------------------  ----------------  -----------------  --------------  -------------------\n",
       "     2015-10-27 16:49:41  1 min 15.434 sec  1.0                4559.66862057   4559.66862057\n",
       "     2015-10-27 16:49:41  1 min 15.465 sec  2.0                4385.44650064   4385.44650064\n",
       "     2015-10-27 16:49:41  1 min 15.491 sec  3.0                4221.99784872   4221.99784872\n",
       "     2015-10-27 16:49:41  1 min 15.517 sec  4.0                4068.11084731   4068.11084731\n",
       "     2015-10-27 16:49:41  1 min 15.543 sec  5.0                3923.31298583   3923.31298583\n",
       "---  ---                  ---               ---                ---             ---\n",
       "     2015-10-27 16:49:44  1 min 19.311 sec  143.0              1166.64947559   1166.64947559\n",
       "     2015-10-27 16:49:44  1 min 19.338 sec  144.0              1165.315081     1165.315081\n",
       "     2015-10-27 16:49:44  1 min 19.365 sec  145.0              1163.82517565   1163.82517565\n",
       "     2015-10-27 16:49:44  1 min 19.393 sec  146.0              1163.04011016   1163.04011016\n",
       "     2015-10-27 16:49:46  1 min 20.789 sec  200.0              1095.13677598   1095.13677598"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Variable Importances:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"overflow:auto\"><table style=\"width:50%\"><tr><td><b>variable</b></td>\n",
       "<td><b>relative_importance</b></td>\n",
       "<td><b>scaled_importance</b></td>\n",
       "<td><b>percentage</b></td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean4</td>\n",
       "<td>709743360.0</td>\n",
       "<td>1.0</td>\n",
       "<td>0.6</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean3</td>\n",
       "<td>172408064.0</td>\n",
       "<td>0.2</td>\n",
       "<td>0.1</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean9</td>\n",
       "<td>126265464.0</td>\n",
       "<td>0.2</td>\n",
       "<td>0.1</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean14</td>\n",
       "<td>50092948.0</td>\n",
       "<td>0.1</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean6</td>\n",
       "<td>44630596.0</td>\n",
       "<td>0.1</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean11</td>\n",
       "<td>30628940.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean17</td>\n",
       "<td>28122880.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean21</td>\n",
       "<td>25222878.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean2</td>\n",
       "<td>20427146.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean7</td>\n",
       "<td>17334488.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean20</td>\n",
       "<td>17059280.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean12</td>\n",
       "<td>13289842.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean8</td>\n",
       "<td>7374711.5</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean15</td>\n",
       "<td>5707966.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>SensorMeasureKalmanMean13</td>\n",
       "<td>5684577.5</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>OpSet1</td>\n",
       "<td>242252.4</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr>\n",
       "<tr><td>OpSet2</td>\n",
       "<td>170719.0</td>\n",
       "<td>0.0</td>\n",
       "<td>0.0</td></tr></table></div>"
      ],
      "text/plain": [
       "variable                   relative_importance    scaled_importance    percentage\n",
       "-------------------------  ---------------------  -------------------  ------------\n",
       "SensorMeasureKalmanMean4   7.09743e+08            1                    0.556921\n",
       "SensorMeasureKalmanMean3   1.72408e+08            0.242916             0.135285\n",
       "SensorMeasureKalmanMean9   1.26265e+08            0.177903             0.0990779\n",
       "SensorMeasureKalmanMean14  5.00929e+07            0.070579             0.0393069\n",
       "SensorMeasureKalmanMean6   4.46306e+07            0.0628827            0.0350207\n",
       "SensorMeasureKalmanMean11  3.06289e+07            0.043155             0.0240339\n",
       "SensorMeasureKalmanMean17  2.81229e+07            0.039624             0.0220674\n",
       "SensorMeasureKalmanMean21  2.52229e+07            0.035538             0.0197919\n",
       "SensorMeasureKalmanMean2   2.04271e+07            0.028781             0.0160288\n",
       "SensorMeasureKalmanMean7   1.73345e+07            0.0244236            0.013602\n",
       "SensorMeasureKalmanMean20  1.70593e+07            0.0240358            0.0133861\n",
       "SensorMeasureKalmanMean12  1.32898e+07            0.0187249            0.0104283\n",
       "SensorMeasureKalmanMean8   7.37471e+06            0.0103907            0.00578678\n",
       "SensorMeasureKalmanMean15  5.70797e+06            0.0080423            0.00447892\n",
       "SensorMeasureKalmanMean13  5.68458e+06            0.00800934           0.00446057\n",
       "OpSet1                     242252                 0.000341324          0.00019009\n",
       "OpSet2                     170719                 0.000240536          0.00013396"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bestGbmModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploratory model analysis\n",
    "\n",
    "See how well the models do predicting on the training set.  Should be pretty good, but often worth a check.  \n",
    "\n",
    "Predictions are an ensemble of the 10-fold cross validation models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train_hex[\"weights\"] = 1\n",
    "allModels = bestGbmModel.xvals\n",
    "pred = sum([model.predict(train_hex) for model in allModels]) / len(allModels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pred[\"actual\"] = train_hex[\"RemainingUsefulLife\"]\n",
    "pred[\"unit\"] = train_hex[\"UnitNumber\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot actual remaining useful life vs predicted remaining useful life\n",
    "\n",
    "Ideally all points would be on the diagonal, indication prediction from data matched exactly the actual.\n",
    "\n",
    "Also, it is important that the prediction gets more accurate the closer it gets to no useful life remaining.\n",
    "\n",
    "Looking at a sample of the first 12 units.\n",
    "\n",
    "Moved predictions from H2O to Python Pandas for plotting using Seaborn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "scored_df = pred.as_data_frame(use_pandas=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sns.set_context(\"notebook\", font_scale=3)\n",
    "g=sns.lmplot(x=\"actual\",y=\"predict\",hue=\"unit\",col=\"unit\",data=scored_df[scored_df.unit < 13],col_wrap=3,fit_reg=False, size=10)\n",
    "\n",
    "ticks = np.linspace(-300,100, 5)\n",
    "\n",
    "g = (g.set_axis_labels(\"Remaining Useful Life\", \"Predicted Useful Life\")\n",
    "      .set(xlim=(-325, 125), ylim=(-325, 125),\n",
    "           xticks=ticks, yticks=ticks))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-300., -200., -100.,    0.,  100.])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.linspace(-300,100, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model prediction and assessment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict on the hold-out test set, using an average of all the cross validation models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "testPreds = sum([model.predict(test_hex) for model in allModels]) / len(allModels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Append the original index information (Cycle and UnitNumber) to the predicted values so we have them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "testPreds[\"Cycle\"] = test_hex[\"Cycle\"]\n",
    "testPreds[\"UnitNumber\"] = test_hex[\"UnitNumber\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Move the predictions over to Python Pandas for final analysis and scoring"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "testPreds_df = testPreds.as_data_frame(use_pandas=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load up the actual Remaining Useful Life information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if doKalmanSmoothing:\n",
    "    actual_RUL = pd.read_csv(_locate(\"rul_FD001_preprocessed.csv\"))\n",
    "else:\n",
    "    actual_RUL = pd.read_csv(\"http://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/CMAPSSData/rul_FD001_preprocessed.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final scoring used in the competition is based on a single value per unit.  We extract the last three predictions and use the mean of those (simple aggregation) and put the prediction back from remaining useful life in T-minus format to cycles remaining (positive)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def aggfunc(x):\n",
    "    return np.mean( x.order().tail(3) )\n",
    "\n",
    "grouped_by_unit_preds = testPreds_df.groupby(\"UnitNumber\", as_index=False)\n",
    "predictedRUL = grouped_by_unit_preds.agg({'predict' : aggfunc })\n",
    "predictedRUL.predict = -predictedRUL.predict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add the prediction to the actual data frame, and use the scoring used in the PHMO8 competition (more penality for predicting more useful life than there is actual)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "final = pd.concat([actual_RUL, predictedRUL.predict], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def rowScore(row):\n",
    "    d = row.predict-row.RemainingUsefulLife\n",
    "    return np.exp( -d/10 )-1 if d < 0 else np.exp(d/13)-1\n",
    "\n",
    "rowScores = final.apply(rowScore, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the final score using PHM08 method of scoring."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1174.2997365847225"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum(rowScores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finally look at the actual remaining useful life and compare to predicted\n",
    "\n",
    "Some things that should ideally would be true:\n",
    "- As RUL gets closer to zero, the prediction gets closer to actual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sns.set_context(\"notebook\", font_scale=1.25)\n",
    "sns.regplot(\"RemainingUsefulLife\", \"predict\", data=final, fit_reg=False);"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
